Data set of a representative online survey on search engines with a focus on search engine optimization (SEO): a cross-sectional study

To gain a better understanding of user knowledge and perspectives of search engines, a fruitful approach are representative online surveys. In 2020, we conducted an online survey with a sample representative of the German online population aged 16 through 69 ( N = 2,012). The online survey included 12 search engine-related sections. The questions cover topics such as usage behavior, self-assessed search engine literacy, trust in search engines, knowledge of ads and search engine optimization (SEO), ability to distinguish ads from organic results, assessments and opinions regarding SEO, and personalization of search results. SEO is the specific focus of the survey, as it was conducted as part of the SEO Effect project, dealing with issues such as the role of SEO from the user perspective. This data set contains complete data from the online survey. On the one hand, the data set will allow further analyses, and, on the other hand, comparisons with follow-up studies.


Introduction
Representative surveys are suitable for gaining a better understanding of how users interact with search engines, how they understand them, and what opinions they have about them. However, such studies are quite rare and usually refer to individual subareas, such as frequency of use (Beisch & Schäfer, 2020) or trust in search engines (Edelman, 2020), while ignoring other areas, such as paid-search marketing (PSM) and search engine optimization (SEO). To close this gap, we conducted an online survey in 2020 with a sample representative of the German online population. Questions on SEO are the focus of the survey, as it was conducted as part of the SEO Effect project, funded by the German Research Foundation. The overall goal of the project is to describe and explain the role of SEO from the perspective of the participating stakeholder groups, one of them being the users. A total of 999 people participated in the online survey on a large screen (e.g., desktop PC), and 1,013 on a small screen (smartphone). The online survey included several search engine-related sections (Schultheiß et al., 2022). Some of the questions were self-developed and others were adopted from other studies. This data set contains the full data from the online survey.

Materials and methods
We conducted a representative online survey with German internet users. The survey was carried out as part of the SEO Effect project in cooperation with the market research company Fittkau & Maaß Consulting (hereinafter abbreviated as F&M) between March and April 2020. F&M performed the following services, all in consultation with the project team: • programming of the survey using FileMaker as a database (January 13 -February 27, 2020) • conducting of the survey (March 2 -April 9, 2020) • data analysis and reporting (April 2020) The subjects were recruited through the online panel provider respondi, which is in cooperation with F&M. An online panel is a sample database with a large number of people (often one million or more). These people have agreed to be available as potential respondents in surveys, as long as they meet the selection criteria for the particular study (Callegaro et al., 2014). In the next section, the sample is discussed in detail.

Sampling
We used a sample that is representative of the German online population according to the criteria applied by "Arbeitsgemeinschaft Onlineforschung" (working group online research; AGOF). For sampling, the characteristics age, gender, and state were used. The population includes German internet users from the age of 16 to 69 years. Based on two subsamples to be formed (see below), both of which had to meet the same requirements regarding representativeness, we intended a minimum sample size of N = 2,000 subjects (recommended by F&M) and achieved a sample size of N = 2,012 subjects.
From the total sample, two sub-samples of N = 999 subjects (large screen) and N = 1,013 subjects (small screen) were formed, which meet the same requirements regarding representativeness described above. Sample 1 attended the survey with a large screen (e.g., desktop PC, laptop, tablet; group "large screen"), sample 2 with a smartphone (group "small screen").
To assign the subjects to one of the two groups, the panel provider detected the user agent string to determine which device and browser the potential subject was using and assigned the participants accordingly. The correct assignment of the test persons was checked by respondi and F&M. The online panel provider respondi checked the devices used by the subjects before forwarding them to the questionnaire. In addition, the devices used by the subjects were verified by F&M as part of the plausibility check of the data by using the user agent string. The subjects were invited to the survey by e-mail. Each participant received 0.75 euro for complete participation. Since we used a sample that is representative of the German online population, we do not assume biases regarding the composition of the sample. However, it should be mentioned that the online survey may have also addressed people who participated solely because of the compensation.

Questionnaire
First, we developed a catalogue of questions. We derived questions for the survey from the objectives of the "SEO Effect" project, from findings of expert interviews (Schultheiß & Lewandowski, 2021d), and from literature research (In Scopus, we searched for surveys that included "search engine" and "information literacy" (or synonyms)). After preparing the questions, we sent them to the market research company (F&M). F&M made recommendations regarding the sequence and formulation of the questions as well as suggestions for new questions, which we included.
In several feedback rounds, we jointly created the final version of the questionnaire (see Table 1). In the introduction to the survey, we first welcomed the respondent and thanked him/her for participating. We also pointed out that the questionnaire is used exclusively for research purposes and that by participating, the respondent agrees to the attached privacy policy of F&M.
To give the subjects the opportunity to obtain background information on the survey and to be able to contact the project team, e.g., for feedback purposes, we provided a link to our website at the end of the survey.
The subjects completed 12 sections within the survey as shown in Table 1   The survey was conducted in the German language. The translated questionnaire is shown in Table 1. The names of the corresponding variables within the data set is included in our research data (Schultheiß et al., 2022) and the original questionnaire in German can be found as part of the research data (Schultheiß et al., 2022).

Marking tasks
We created eight SERP screenshots for the marking tasks A-D (each task in variants "large screen" and "small screen"). The screenshots are available as part of the research data (Schultheiß et al., 2021).
SERPs A and B were assigned to block I (simple), SERPs C and D to block II (difficult). Two blocks were created to address a variety of SERP elements and to differentiate between basic and complex SERPs. The structure of the two SERPs per block is identical in terms of the elements on the SERP.
Each participant received two tasks, one from block I and one from block II, as shown in Table 2. The SERP for each task was shown two times. First, all ads were to be marked and second, all organic results.
The screenshots were created using the desktop version of the Chrome browser: b. Small screen: A browser zoom of 300% resulted in screenshots with a width of 984 px, where the horizontally displayed results (e.g., shopping results) were not cut off/cut in half. i. Both zoom settings (400%/300%) were also the highest possible settings for the screenshot tool to capture the entire SERPs.
3. Screenshot: The add-on GoFullPage version 7.1 was used to capture full-page SERP screenshots as PNG files. For each query, the first three SERPs were saved to be able to exchange results during later image processing.  4. Image processing: We used GIMP version 2.10.14 (GIMP development team, 2020) (RRID:SCR_003182) to reduce the SERPs to the elements we wanted to investigate (see Table 2). We also matched the small screen SERPs with the large screen SERPs in terms of results and their positions. Otherwise, different selection behavior in the survey might not have been due to the SERP layout (large vs. small screen), but to partially different results (positions): a. Large screen: i. The large screen SERPs were reduced to the elements required in the survey, i.e., without "related searches", "people also ask".
ii. Due to the specifications of F&M, the final large screen SERPs were reduced to a width of 800 px.
b. Small screen: i. The results of the small screen SERPs as well as their positions were aligned with the large screen SERPs. Consequently, the large and small screen SERPs for a query only differed in terms of layout, but not in terms of results and their positions.
ii. Due to the specifications of F&M, the final large screen SERPs were reduced to a width of 360 px.
Flowchart Figure 1 shows the flowchart of the online survey.

Pre-test
Before the survey was conducted, pre-tests were carried out in February 2020 by the members and student assistants of the research group (N = 7) and by the panel provider. This enabled us to test whether problems arose, e.g., regarding comprehensibility, and to eliminate them beforehand.
In the pre-test, problems arose regarding the plausibility of the questionnaire which needed to be fixed before launching the survey. The panel provider checked the survey internally with colleagues to ensure that it was coherent and comprehensible. The duration of the survey was also checked. The maximum duration of 15 minutes as recommended by F&M was met in the pre-tests. Suggestions of the pre-test subjects were also incorporated. These concerned some minor aspects, such as the optical highlighting of relevant parts of a question (e.g., "Are there any search results on this page that can be influenced by search engine optimization?"). These recommendations were also implemented. After the pre-test, the soft launch started, in which the responses of those subjects who completed the survey first were carefully analyzed. Since the soft launch was successful, the survey could start as planned and the data of the soft launch subjects could also be included in the analysis.

Ethical approval
Due to the design of the research, we consider the study to be of very low risk for participants. Accordingly, we did not obtain ethical approval. The market research company (F&M), which carried out the survey in cooperation with us, operates according to the principles of the UN Global Compact. This means that F&M operates in a way that fulfils fundamental values regarding human rights, labour, environment, and anti-corruption. Written consent to process their data was obtained from all participants. When registering with online panel provider respondi, participants agreed to the use of their data. For those participants who were minors (16 and 17 years old), parental consent was not required, since "the processing of the personal data of a child shall be lawful where the child is at least 16 years old" (see Article 8 EU GDPR). Data were analysed anonymously. We had no direct contact to the subjects.
Processing of the data Coding and grouping Table 3 lists the open-ended questions and the coding specifications. The answers to the knowledge questions were only differentiated into "correct", "partly correct", and "incorrect", since no specifications were made regarding the number of elements (e.g., SEO techniques; question no. 7.3) to be mentioned. The coding of the open-ended questions was done by one coder, which we considered adequate because the coding did not leave any significant room for interpretation. Table 4 shows how the topics from professional activity, training, and studies have been grouped in terms of SEO affinity (low, average, high). To group the topics, we examined module handbooks of the studies for intersections with the SEO topic. In the case of training and professional activity, e.g., pedagogy, we examined corresponding studies, e.g., educational science.
Success rates for marking tasks Table 5 shows the search results to be marked on the SERPs according to the task, device, and area (SEO or PSM).
Based on the marked elements, a success rate was calculated for each participant per task (A-D), device (large, small), and area (SEO, PSM). This rate accounts for correctly marked (true positive) and incorrectly marked (false positive) results using the formula n trueÀn false n to be marked .  -Sustainable/social: e.g., "they plant trees" -Privacy -Technical advantages: e.g., "easy to use with keywords" -Quality: e.g., "more results than other search engines" -Habit -Against Google: e.g., "I think Google is too powerful" -Pro Google: e.g., "I like that Google pays attention to its users" 6.1 When it comes to the search results displayed on Google: What do you think influences the ranking of search results on Google?
-Payment -Algorithm -Query of the searcher: e.g., "order of terms" -Tools for website optimization -Traffic/ranking of the website: e.g., "number of clicks" -User behavior: e.g., "search history" -User's Google profile: e.g., "my personal data" -Topicality/quality/seriousness of the website: e.g., "quality and relevance criteria in terms of content and technology" -Google's self-interests -Other: e.g., "No idea. Google gives little information on this" 7. 1 What do you think: Where does Google generate most of its revenue from? -Correct: "ads" or terms having the same meaning (e.g., advertisement, sponsored results, search engine advertising, SEA, paid search marketing) -Partly correct: correct term (e.g., ads) and at least one incorrect term -Incorrect: clearly incorrect terms (e.g., data sale, donations)

7.4
And how do the paid search results on Google differ from the other results that have not been paid for?
-Correct: "ad label" or terms having the same meaning (e.g., ad, ad term, label, marking), with or without mentioning the separate position of the ads -Partly correct: correct term (e.g., ad label) and at least one incorrect term -Unclear: only position named as characteristic (e.g., "always the top results") -Incorrect: clearly incorrect terms (e.g., different font) 8. 2 Do you know what term is used to describe these measures to improve the ranking in the Google search results list (without payment to Google)?
-Correct: "search engine optimization" or terms having the same meaning (e.g., SEO) -Partly correct: correct term (e.g., SEO) and at least one incorrect term -Incorrect: clearly incorrect terms (e.g., ads, bots)

8.3
And by what means can a website be designed or programmed so that it is ranked higher in the Google search results lists? -Correct: "keywords" or other correct SEO techniques -Partly correct: correct term (e.g., keywords) and at least one incorrect term; or only "SEO" -Incorrect: clearly incorrect SEO techniques (e.g., payment, ads)

10.3
Which positive effects does search engine optimization have in your opinion?
-Better/more relevant results: e.g., "best result on position 1" -Quicker retrieval: e.g., "you find what you're looking for faster" -Advantages for the searcher such as individualization, filters: e.g., "the search engine knows me" -Advantages for website operators: e.g., "optimized pages receive more clicks" -Other: e.g., "correction of spelling mistakes"

10.4
Which negative effects does search engine optimization have in your opinion?
-Negative influence on results quality: e.g., "first result not always the best" -(Conscious) influence, manipulation of the results with negative background: e.g., "no objective results" -Displacement of the actually searched, desired, suitable search results: e.g., "commerce and profit comes before truth" -Discrimination against smaller websites/providers: e.g., "distortion of information in favor of solvent website providers" -Other: e.g., "you have to pay attention" SEO: Search engine optimization, SEA: Search engine advertising   Two examples follow, the first for achieving a positive success rate for task A, large screen, SEO results. In this case, 10 organic results are to be marked, of which the subject marks 8 results (8 true). In addition, the subject incorrectly marks 2 ads (2 false). This results in a success rate of 0.6. Negative success rates are also possible, if a subject makes more incorrect than correct markings, exemplified by task B, small screen, PSM results. In this case, a total of 4 text ads are to be marked. If a subject identifies all 4 text ads (true), but additionally marks 6 organic results (false), the subject achieves a success rate of -0.5.
For the calculation of the success rates and the corresponding variables of the data set, see Appendix 1: Calculation of success rates.

Lluís Codina
Universitat Pompeu Fabra, Barcelona, Spain This paper presents in detail a complete methodology to carry out online surveys among German search engine users. In addition to the explanation of the methodology, and its various steps, it includes theoretical foundations. The way to proceed step by step and the precautions used for the validity of the work are explained in detail. The article also presents the complete tables and schemes used for the survey, which provides a high value to this work.
The rationale presented for the creation of the data set is clearly indicated. The research it supports is necessary and timely as it fills in clearly identified research gaps. In addition, the data set is rich and conceptually very powerful with clear possibilities to support ambitious investigations.
On the other hand, the protocols used are described exhaustively and with sufficient technical and conceptual details. The article points out that an ethical approval was not required, but they convincingly explain the reasons for it, and the guarantees adopted by both the researchers and the contracted company.
For the above reasons, the dataset is traceable, which is why other researchers can analyze and replicate the research.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Search Engine Optimization, Digital News Media
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
involved, ALL university research must go through an ethical approval process." We agree that, although we did not consider it necessary, it would have been better to obtain ethical approval for our study. For future studies, we will therefore obtain ethical approval for all studies without exception. ○ "The amount of self-citation in this article is extreme. 8/18 of the references (that is about 44%) are the authors themselves! This is unacceptable in the academic world." Thank you for the criticism, which we would agree with in the case of regular research articles. However, since our article is a data note, we have a different view on self-citation, which we are glad to clarify. As specified by F1000research, the focus of data notes is a comprehensive description of the data, but not an extensive introduction or extensive literature sections. On the one hand, this justifies the low number of cited references. On the other hand, it also explains the high self-citation rate since the online survey methodology described in the data note is directly related to our preliminary work. "No results are given in the abstract. That is standard practice, and some indication of the results must be given in an abstract." Since the article is a data note, it does not include any results, which is why the abstract does not include any results either. This is in line with the F1000reserarch guidelines for data notes: "Data Notes are brief descriptions of datasets that promote the potential reuse of research data and include details of why and how the data were created; they do not include any analyses or conclusions." (see https://f1000research.com/for-authors/article-guidelines) ○ 3.
"It is not clear what Table 1 is supposed to present." Table 1 shows the questionnaire. It contains the questions the subjects were asked, the response options, and the sources of the respective questions. Could you please elaborate on what is unclear about the table from your perspective? ○ 4.
"The reference to Table 1 on page 4 is wrong. There is no translated questionnaire in Table 1." Since the study was conducted in German, the English version of the questionnaire shown in Table 1 is the translated questionnaire. Thus, we have not made any changes. If the answer to this question was "yes," the next question was 7.3, which was also a closed question. If the answer to this question was again "yes," the next question was open-ended question 7.4. Then, question 8.1 followed for all participants.